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The complexity of an organism is only weakly 
linked with its number of genes. Homo sapiens has 
about 25000 genes and the roundworm C. elegans 
about 19000 despite the different level of com- 

plexity. Not only the gene numbers are similar, the 
genes themselves are frequently shared across species. 
Even distantly related organisms have a high fraction 
of genes which stem from their common ancestor (or- 
thologs): more than 90% of genes are shared between 
human and mouse and at least 30% of genes of the 
yeast S. cerevisiae have orthologs in human [3]. 

This result is an important outcome of the recent 
genome sequencing projects. It has put the spotlight 
on the interactions between genes: changes in the 
complex networks of gene regulation, or in the in- 
teractions between proteins, may be a major cause 
of phenotypic variation, more so than changes in the 
genes themselves 0]- The molecular basis of these 
interactions includes specific binding sites on regula- 
tory DNA and binding domains in proteins. Binding 
sites can change quickly generating new interactions 
or deleting old ones |51 EJ E] • 

The resulting interest in biological interactions has 
been matched by the development of novel experi- 
mental techniques to measure protein-DNA interac- 
tions and protein-protein interactions. In particular, 
high-throughput methods have been developed, facil- 
itating measurements on a genome-wide scale rather 
than for individual genes. Some of the ingenious 
methods for experimentally determining biological 
interactions will be briefly reviewed in the next sec- 
tion. 



This experimental development is akin to the tran- 
sition from sequencing small parts of the DNA of an 
organism to the determination of full genomes. The 
growth of sequencing capabilities has been driving 
the development of computational methods for se- 
quence analysis for the past three decades. Virtually 
all methods for sequence analysis rely on statistics as 
a tool to infer function. Examples are the detection 
of genes, or of regulatory modules, or the identifi- 
cation of correlations between evolutionarily related 
sequences [5]. 

The corresponding development of computational 
network biology is still in its infancy. It will require 
new tools to address specific issues of biological net- 
works. These are characterized by a peculiar inter- 
play of stochasticity and function, and in many ways 
epitomize our current lack of understanding of biolog- 
ical systems. With this caveat, the point of view we 
take in this article is that statistics will again play a 
decisive role in our understanding of network biology, 
and we point out some currently available links be- 
tween network statistics and function. The merit of 
a statistical approach may not seem obvious from an 
engineering perspective, where networks are seen as 
deterministic processing machines producing a well- 
defined input-output relation. Indeed, biological net- 
works sometimes work in a surprisingly deterministic 
way: for example, a network of a few dozen major 
genes generates a well-defined spatiotemporal devel- 
opment pattern in the cukaryotic embryo. However, 
the underlying network structures are fundamentally 
stochastic, since they arise from the manifold tinkcr- 
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ing and feedback processes of biological evolution. 
Explaining deterministic function from a stochastic 
evolution requires a statistical, dynamical theory. 

One important aspect of this challenge is to pre- 
dict different functional units in networks. Differ- 
ent functions are reflected in a different evolutionary 
dynamics, and hence in different statistical charac- 
teristics of network parts. In this sense, the global 
statistics of a biological network, e.g., its connectivity 
distribution, provides a background, and local devi- 
ations from this background signal functional units. 
In the computational analysis of biological networks, 
we thus typically have to discriminate between dif- 
ferent statistical models governing different parts of 
the dataset. The nature of these models depends on 
the biological question asked. We illustrate this ra- 
tionale here with three examples: identification of 
functional parts as highly connected network clusters, 
finding network motifs, which occur in a similar form 
at different places in the network, and the analysis of 
cross-species network correlations, which reflect evo- 
lutionary dynamics between species. 

1. Measuring biological networks 

A wide range of experimental methods has been de- 
veloped to measure interactions between proteins, in- 
teractions between proteins and regulatory DNA, and 
expression levels of genes. Only a brief review is pos- 
sible here. 

In the yeast-two-hybrid (Y2H) method, the pair- 
wise interaction between two proteins is tested by 
creating two fusion proteins O ne protein is con- 
structed with a DNA-binding domain attached to its 
end, and its potential binding partner is fused to an 
activation domain. If the two proteins interact, the 
binding will form a transcriptional activator (gener- 
ally consisting of a DNA-binding domain and an ac- 
tivation domain) . The presence of an intact activator 
leads to the transcription of an easily detectable re- 
porter gene. (The reporter gene may for instance pro- 
duce a fluorescent protein.) In principle, the amount 
of the reporter gene produced can serve as a mea- 
sure of the affinity between the two proteins. The 
Y2H-method has been used to measure the protein 
interaction networks of yeast C. elegans 




Figure 1: Deviation from a uniform global statis- 
tics in biological networks, a) A network cluster is 
distinguished by an enhanced number of intra-cluster in- 
teractions, b) A network motif is a set of subgraphs with 
correlated interactions. In a limiting case, all subgraphs 
have the same topology, c) Cross-species correlations 
characterize evolutionarily conserved parts of networks. 

D. melanogaster ancl human The Y2H- 

datasets are known to contain a large number of false 
positive and false negative results. False negatives 
arise when the fusion proteins fail to localize in the 
yeast nucleus, or fail to fold properly once the new 
domains are attached. False positives may be linked 
to high expression levels of the hybrid in yeast, which 
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are never reached in vivo. 

Alternative approaches include pull-down assays, 
where one protein type is immobilized on a gel, and 
'pulls down' binding partners from a solution. Bind- 
ing partners may then be identified by various tags. 
Mass spectrometry is also used to identify the in- 
teracting protein pairs identified by such an affinity 
analysis While more accurate than the Y2H- 

method, these approaches have not yet been scaled 
up to provide high throughputs. 

Binding of proteins, specifically transcription fac- 
tors, to regulatory DNA has long been investigated 
by electrophoresis, where the motility of a DNA- 
fragment is altered by a protein bound to it. Chro- 
matin immunoprecipitation (ChIP) is an alternative 
procedure, which uses specific antibodies to isolate 
a protein and then amplifies DNA that may have 
been isolated together (co-precipitated) with the pro- 
tein. By running many such experiments in parallel 
on a microarray, this method can be scaled up to high 
throughputs (ChlP-on-chip, [T5]V 

Gene expression levels can be measured on DNA- 
microarrays, densely packed samples of known nu- 
cleotides, each a few tens of base-pairs long. Cur- 
rently more than fO 6 such samples, or probes, can 
be placed on a single microarray. The array is then 
washed with a fluorescently labeled sample. Binding 
of DNA in the sample to complementary DNA on the 
probe can be detected under a microscope from the 
resulting fluorescence pattern. Genome-wide expres- 
sion levels can thus be measured on a single array. 
Many other applications of microarrays are being de- 
veloped — for instance microarrays to measures in- 
teractions between transcription factors and regula- 
tory DNA. DNA-microarrays are also making major 
inroads as diagnostic tools, from characterizing the 
microbial communities in dentistry |16) to the early 
detection of cancer |17) . 

2. Random networks in biology 

Randomly generated networks are very useful to an- 
alyze simple characteristics of biological networks. 
For instance, typical distances on a randomly gen- 
erated network generally scale logarithmically with 
the number of network nodes. Finding such short 



distances also in biological network data is therefore 
not a surprising result and does not require a bio- 
logical explanation. Another frequent observation in 
biological networks is a distribution of node connec- 
tivities with a broad tail, which is shared by specific 
ensembles of random networks. This has motivated 
a number of statistical models explaining the connec- 
tivity distribution in terms of the underlying evolu- 
tionary dynamics ^0 1191 I2U| . Thus, ensembles of 
random networks can be tuned to fit certain charac- 
teristics of biological network data. Does that mean 
the actual network is random? This is clearly not the 
case: other observables may differ from what is ex- 
pected in the random network ensemble, and we will 
see that these deviations from the "null model" are 
particularly interesting as signals of biological func- 
tion. Hence, random network ensembles play an im- 
portant role in quantifying the most unbiased back- 
ground statistics of a "functionless" network. Their 
choice is a subtle issue: it has to be motivated by 
what we consider as not important for the biological 
function in question. Let us now turn to a few such 
models. 

A network is specified by its adjacency matrix a = 
(aui). For binary networks aw = 1 if there is a link 
between nodes i and i', and aw = if there is no 
link. Networks with undirected links are represented 
by a symmetric adjacency matrix. The in and out 
connectivities of a node, kf = am and kj = 
a w j are defined as the number of in- and outgoing 
links, respectively. The total number of directed links 
is given by K = £\ i( aw . 

To focus on a specific part of the network we de- 
fine an ordered subset A of n nodes {r%, . . . r n } (see 
Fig. ^l). The subset A induces a pattern a(A) on 
the network, represented by the restricted adjacency 
matrix containing only links internal to node subset 
A. a is thus an n x n matrix with entries ay ~ a rirj 
(i,j = 1, . . . , n). Together, subset of nodes A and its 
pattern a(„4) form a subgraph. 

The simplest ensemble of random networks is gen- 
erated by connecting all pairs of nodes independently 
with the same probability w. Given a subset of 
nodes A, the probability of generating pattern a is 
then given by -Po(a) = IIiVe.4(l - w) 1 ^' 1 "' w a "' (for 
undirected networks the sum is restricted to i < i'). 
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This well-known ensemble, named after the pioneers 
of graph theory P. Erdos and A. Renyi, leads to a 
Poissonian distribution of connectivities. The only 
free parameter of the Erdos-Renyi (ER) model, the 
link probability w between a given pair of nodes, can 
be tuned so that typical graphs taken from the ER- 
ensemble have the same number of links as the empir- 
ical data. If the subset of nodes A contains all n = N 
nodes of the network, w = K/N 2 . Considering con- 
nected subgraphs with n < N , w will in general be 
higher than K/N 2 . Then the value of w can be deter- 
mined by generating all connected subgraphs of size n 
from the empirical dataset and choosing w such that 
the average number of links in the ER model equals 
the average number of links in connected subgraphs 
in the data. 

However, in biological networks the connectivity 
distribution often differs markedly from that of the 
Erdos-Renyi-model. If we have reasons to assume 
that a biological function is not tightly linked to 
connectivity at the level of individual nodes, we 
should include the connectivity distribution in our 
null model. Indeed, we can easily construct a ran- 
dom ensemble matching the connectivity distribution 
of the dataset. In this ensemble, the probability tun* 
of finding a link between a pair of nodes i, i' depends 
on the connectivities of the nodes. Assuming links be- 
tween different node pairs to be uncorrelated, a given 
subset of nodes A has a pattern a with probability 

n 

F (a)= J] (l-Wi^-^'wy. (1) 

For n — N, when A includes the entire network, the 
probability of finding a directed link between nodes i 
and i' is approximately ww = k~k^ r /K, that of an 
undirected link wa< = k ri k r .,/K 21. If we further- 
more impose the constraint that the null model de- 
scribe the statistics of a connected dataset, the proba- 
bilities in (|TJ are increased by a factor that can be de- 
termined from the data as described above. The null 
model constructed in this way is maximally unbiased 
with respect to all patterns in the dataset beyond its 
connectivity distribution. 



3. Network clusters 

A first trace of functionality in biological networks 
are strong inhomogeneities in their link statistics, 
which are not captured by the null model. Exam- 
ples are protein aggregates of several proteins held 
together by mutual interactions, which show up as 
highly connected clusters in protein interaction net- 
works, and sets co-regulated genes (for instance by 
an oncogene 22 ) , leading to clusters in co-expression 
networks. How can we identify these clusters statis- 
tically? 

Clusters are subgraphs with a significantly in- 
creased number of internal links compared to the 
background of the network, see Figure^,). The fea- 
ture distinguishing clusters is the number of internal 
links, 

n 

L{k) = ««' • ( 2 ) 

i,i'eA 

The statistics of clusters is then described by an en- 
semble 

Q CT (a) = Z- 1 exp[<7L(a)] P (k) (3) 

of the same form as |JT]|. but with a bias towards a 
high number of internal links. The average number 
of internal links is determined by the value of the 
link reward a. We have introduced the normalization 
factor Z a = Y\Z'J2a u ,=o,i ex P{ ljL (^ P o( k )' which 
ensures that Q a (a) summed over all patterns a gives 
unity. 

Is a given pattern a more likely part of a cluster as 
described by the model ©, or is it more likely part 
of the background described by the null model 
To address this question, we define the so-called log- 
likelihood score 

S(A, a) = log (fj#) = <?L{k{A)) - log Z a . (4) 

A positive score results if it is more likely for the 
pattern a(„4) to arise in the model describing clus- 
ters than in the alternative null model. High scores 
indicate strong deviations from the null model. Of 
course this an attractive property for the algorithmic 
search for deviations from the null model. As shown 
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in the appendix, the form of the score (QJ is related in 
a simple way to the probability that pattern a comes 
from the model describing clusters. 

Patterns with a high score QJ are bona fide clus- 
ters. The first term of the score weighs the total num- 
ber of links. As expected, a pattern with many inter- 
nal links yields a high score. The second term acts as 
a threshold and assigns a negative score to a pattern 
with a too small number of internal links. This term 
takes into account the connectivities of the nodes: 
highly connected nodes have more internal links al- 
ready in the null model. Node subsets with highly 
connected nodes tend to give lower scores. The score 
(@J thus goes beyond simple measures of clustering, 
such as the number of internal links, and provides a 
statistical basis for cluster detection. 

Given the scoring parameter a, the maximum-score 
node subset A* (a) is defined by 

A* {a) = argm&x A S (A, a) . (5) 

At this point, the scoring parameter a is a free pa- 
rameter, whose value needs to be inferred from the 
data. This can be done by applying the principle 
of maximum likelihood: a is determined by the re- 
quirement that the model describing clusters J3J opti- 
mally describes the statistics of the maximum-score 
pattern. For a given pattern a, the optimal fit is 
defined by the so-called maximum likelihood value 
a* = argmax .Q( T (a(^l)), which maximizes the likeli- 
hood of generating pattern a(.4) under the model © . 
Since log(a;) is a monotonously increasing function, 
the maximum likelihood value a* coincides with the 
maximum of the log-likelihood score (0} over a. The 
maximum-score node subset at the optimal scoring 
parameter is then determined by the joint maximum 
of the score over A and a 

S{A*,<t*) = max S( A* (a), a) = maxS(A,a) . (6) 

<y A.<j 

One can easily show that the maximum-likelihood 
value of tr sets the expected number of links in the 
ensemble Q a * equal to the actual number of links in 
pattern a*: setting the derivative of with respect 
to a equal to zero gives 

<L(a)>Q a . =i(a*). (7) 




,SN03 
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Figure 2: Scoring clusters in protein interaction 
networks, a) The score S of the maximum-score node 
subset A*{o) is shown as a function of the scoring param- 
eter a. The dotted lines indicate the values of a where 
the maximum-score node subset changes. The maximum 
of the score with respect to a indicates the optimal scor- 
ing parameter a* — 6.6. The grey region 4.25 < a < 7 
indicates the values where A*(cr) = A* (a*), b) The 
maximum-score subgraphs for a < 4.25, 4.25 < a < 7, 
7 < a < 11,(7 > 11 (left to right). The subgraph result- 
ing from the optimal scoring parameter is highlighted in 
grey. The maximum-score subgraphs for 7 < a < 11 and 
for a > 11 are distinguished by the connectivities of their 
nodes with the latter having a higher average connectiv- 
ity. This accounts for the former having a higher score for 
7 < a < 11 despite the smaller number of internal links. 

Clusters in protein interaction networks. We 

use the scoring function (0J to identify clusters in the 
protein interaction network of yeast, namely the high- 
throughput dataset of Uetz et al. (TJ3- At a given 
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value of the scoring parameter a, the maximum-score 
node subset A* (a) is identified using a simple Monte- 
Carlo algorithm. At different values of a, different 
node subsets A* (a) yield the highest score (compared 
to all other node subsets). The resulting subgraphs 
are shown in Fig. . At low values of a, subgraphs 
with many nodes, but comparatively few internal in- 
teractions per node yield the highest score. At high 
values of a, subgraphs with many internal interac- 
tions are favored. However these subgraphs tend 
to be small. The interplay between subgraph size 
and internal connectivity leads to a joint score maxi- 
mum over A and a at the optimal scoring parameter 
a* = 6.6, see Fig. Et). 

The maximum-score cluster A* = A* (a*) consists 
of the proteins SNZ1,SNZ2,SN01,SN03, and SN04, 
highlighted in grey in Fig. 0b). The proteins in this 
cluster have a common function; they are involved in 
the metabolism of pyridoxine and in the synthesis of 
thiamin |23I23I- Furthermore, SNZ1 and SNOl have 
been found to be co-regulated and their mRNA levels 
increase in response to starvation for aminoacids A, 
Ura, and Trp [23]. 

4. Network motifs 

The topology of a subgraph may be associated with 
a specific function. A possible example is a feed- 
forward loop acting as a high-frequency filter in a 
regulatory network |26|. If such a function is required 
repeatedly in different parts of the network, there is 
selection pressure for the creation and maintenance 
of similar topologies in different parts of a network. 
Such network motifs |27U26| are families of subgraphs 
distinguished from the null model by mutual correla- 
tions between subgraphs, see Fig. Q]b). 

To quantify these correlations, we need to specify 
the parts of the network with correlated patterns. We 
define a graph alignment A by a set of several node 
subsets A a (a = 1, . . . ,p), each containing the same 
number of n nodes, and a specific order of the nodes 
{rf , . . . ,r°} in each node subset. An alignment as- 
sociates each node in a node subset with exactly one 
node in each of the other node subsets. The align- 
ment can be visualized by n "strings" , each connect- 
ing p nodes as shown in Fig. ^b) . 



An alignment specifies a pattern a Q = k(A a , A) 
in each node subset. For any two aligned subsets of 
nodes, A a and Ar, we can define the pairwise mis- 
match of their patterns 



E^'(i-4 



(i 



(8) 



The mismatch is a Hamming distance for aligned pat- 
terns. The average mismatch over all pairs of aligned 
patterns is termed the fuzziness of the alignment. 

Frequently network motifs also have an enhanced 
number of internal links [261 12 7| , providing the pos- 
sibility of feedback or other faculties not available to 
tree-like patterns. An ensemble describing p node 
subsets with correlated patterns a 1 , . . . , sP with an 
enhanced number of links is given by 



Wa 1 ,...,F) = 2-inft(a a ) 



(9) 



x exp 



2p 



j2 M( a a , a p ) + aj2m a ) 



a, /3=1 



a = l 



The parameter /i > biases the ensemble © 
towards patterns with small mutual mismatches 
M(a Q , a* 3 ). 

Given the null model £Q) and the model © with 
correlated patterns, we obtain a log-likelihood score 
for network motifs 



S(A,(j,,a) 
= log 



Po(a\...,aP) 



= -£ E W(4»,^)+<r££(S») 

P a, 0=1 a=l 



(10) 



High-scoring alignments A indicate bona fide network 
motifs. The first and second term reward alignments 
with a small mutual mismatch and a high number of 
internal links, respectively. The term log Z a ^ acts as 
a threshold assigning a negative score to alignments 
with too high fuzziness or too few internal links. 
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Again, both the alignment A and the scoring pa- 
rameters // and cr are a priori undetermined. For 
given scoring parameters, the maximum-score align- 
ment 

A*((j,,a) = &rgmax. A S(A, (j,,a) (11) 

occurs at some finite value of the number of sub- 
graphs p*(fi, cr). 

The scoring parameters fi and a can again be de- 
termined by maximum likelihood, which corresponds 
to maximizing the score S(A*(fJ,,a), fi,a) with re- 
spect to the scoring parameters. By differentiating 
(fTUfl with respect to the scoring parameters one finds 
that at fi = n* and a = a* the model (O fits the 
maximum-score network motifs: the expectation val- 
ues of the internal number of links and the fuzziness 
equal the corresponding values of the maximum-score 
alignment. 

Network motifs in regulatory networks. We 

now apply the scoring function 11U|) to the identifica- 
tion of network motifs in the gene regulatory network 
of E. coli, taken from '26; . A full account and a score- 
maximization algorithm are given in |28j. 

We first investigate the properties of the max- 
imal score alignment at fixed scoring parameters. 
Fig. |3 (a) shows the score S and the fuzziness M for 
the highest-scoring alignment with a prescribed num- 
ber p of subgraphs, plotted against p. The fuzziness 
increases with p, and the score reaches its maximum 
5*(cr, ^) at some value p*(<r,n). For p < p*(cr, ^) 
the score is lower, since the alignment contains fewer 
subgraphs and for p > p*(o~,fi) it is lower since the 
subgraphs have higher mutual mismatches. 

The optimal scoring parameters fi and a are again 
inferred by maximum likelihood. The resulting opti- 
mal alignment ^4* = ,4*(/i*, cr*) is shown in Fig. 0(b) 
using the so-called consensus motif 



S (o,M) - 



(12) 



a=l 



The consensus motif is a probabilistic pattern; the en- 
try a denotes the probability that a given binary link 
is present in the aligned subgraphs. The motif shown 
in Fig. 13(b) consists of 2 + 3 nodes forming an input 
and an output layer, with links largely going from the 
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fnr yhfA crp araC crp fnr 
fucPIKUR crp crp deoR 
rpoH himA glnALC fliAZY 
himA mdh himA crp 



acs prsA serA 
araBAD flhDC 
narK fucAO 
galETKM gltA 
tyrB ecfl 
ompR_envZ 
glnHPQ flhBAE 
ibpAB fpr 
glcDEFCB 
glpTQ purC uhpA 




(b) 



arcA adhE ansB araj caiF fdnCH 
metA galS dctA deoCABD sip 
ompF fhIA fliLMNOPQR narL 
fumC nupC glpACB maIXY ppsA 



idnDOTR nrdAB fnr crp hns 
narZYWV crp CaIR arcA cytR 
cpxAR envY_ompT 
rpsU_dnaC_rpoD flhDC fnr 
speA arcA glpR moaABCDE cytR 



aceBAK 
aldB dcuB_ 
firmB araE 
fixABCX caiF 
meIR mhpABCDFE 
focA_pflB nycA 
htrA oppABCDF 
fdhF flgBCDEFCHIJK 
focA_pflB zwf 
snxR glpD 
marRAB rpoN 



Figure 3: Motifs in the regulatory network of E. 
coli. (a) Score optimization at fixed scoring parameters 
a = 3.8 and /i = 4.0 for subgraphs of size n — 5. The total 
score S (thick line) and the fuzziness M (thin line) are 
shown for the highest-scoring alignment of p subgraphs, 
plotted as a function of p. (b) The consensus motif of 
the optimal alignment, and the identities of the genes 
involved. The alignment consists of 18 subgraphs sharing 
at most one node. The 5 grey values correspond to the 
consensus motif a in the range 0.1 — 0.2, 0.2 — 0.4, 0.4 — 0.6, 
0.6-0.8, 0.8-0.9. 



input to the output layer. Most genes in the input 
layer code for transcription factors or are involved in 
signaling pathways. The output layer mainly consists 
of genes coding for enzymes. 

5. Cross-species analysis of networks 

The motifs discussed above show correlation with- 
out sharing a common evolutionary history. Larger 
functional units may be distinguished by their evo- 
lutionary conservation. Thus, we expect parts of the 
network to maintain their topology and to form a 
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conserved core, while other parts show a more rapid 
turnover of both nodes and interactions, see Fig- 
ure n£). This conservation can be detected as topo- 
logical correlation across species. 

We assume that organisms evolve independently 
after speciation, leading to divergence in their net- 
work links as well as in the overall similarity of the 
nucleotide sequences, the structure of proteins, and 
the biochemical role of a metabolite. The relation- 
ship between link and node similarity is non-trivial: 
genes may retain their function and their interactions 
with other genes despite considerable sequence diver- 
gence. On the other hand, the change of a few nu- 
cleotides can create or destroy a binding site, imply- 
ing that genes with high overall sequence similarity 
may have entirely different interactions. Hence, the 
cross-species analysis has to take into account infor- 
mation from both links and nodes. 

A log-likelihood score assessing the link statistics of 
node subsets in network A and in network B follows 
directly from l|10|) . This link score is given by 

S e (A, n, cja,<jb) = —fJ,M(k, b) (13) 
+a (i(a) + L(b)) -logZ M , CT . 

To assess the similarity of nodes, we consider a 
measure Qij , which describes the similarity of node i 
in network A and node j in network B. The node sim- 
ilarity measure may be a percentage sequence iden- 
tity, or a distance measure of protein structures. The 
information on node similarity can be incorporated 
into the alignment score by contrasting a null model 
with a model describing a statistics where node simi- 
larity is correlated with the alignment. To construct 
the null model, we assume that node similarities 8ij 
for different node pairs i,j are identically and inde- 
pendently distributed and denote their distribution 
by Po(0ij). The model describing cross-species corre- 
lations has to take into account that the distribution 
of node similarities between aligned pairs of nodes fol- 
lows a different statistics (typically generating higher 
values of 9), denoted by q r i(0). The distribution of 
pairwise similarity coefficients between one aligned 
node and nodes other than its alignment partner is 
denoted by q%{6). 




HMGN1/Parp2 HMGN1/HMGN1 

a) b) 



As'(a,b) |a| 

-1.7 13 OS 1 

Figure 4: Cross-species network alignment shows 
conservation of gene clusters, (a) 7 genes from a clus- 
ter of co-expressed genes (circle) together with 7 random 
genes outside the cluster (straight line) . Each node repre- 
sents a pair of aligned genes in human and mouse. The 
intensity of a link encodes the correlation value a in hu- 
man. The color indicates the evolutionary conservation 
of a link, with blue hues indicating strong conservation. 
Th conservation is quantified by the excess link score con- 
tribution, As e , defined as the link score minus the aver- 
age link score of links with the same correlation value, 
(b) The same cluster, but with human-HMGNl "falsely" 
aligned to its ortholog mouse-HMGNl, with the red links 
showing the poor expression overlap of this pair of genes. 

Assuming that the statistics of links and nodes 
similarities are uncorrelated for a given alignment, 
a simple calculation analogous to yields the log- 
likelihood score 

S(A) = S e (A) + S n {A) , (14) 

with the information from node similarity contribut- 
ing a node score 

S n (A)=Y,Si(0ii)+ E s 2(%) (15) 
ieA i e A,j i= i 

3 eB,i£A 

and = log(tf(0)/pff(0)) and = 

log(#(0)/pS(0)). 

The scoring parameters entering l|14l) need to be 
determined from the data. Provided there are not 
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too many scoring parameters, this can again be done 
by maximum likelihood as outlined in the preceding 
sections. Particular examples are networks with bi- 
nary links and course-grained measures of sequence 
similarity. (As an extreme case, node similarity may 
be considered a binary variable, when nodes either 
have significant similarity or not. Then the ensem- 
bles describing the node statistics are each described 
by a single variable, see for details.) 

Alignment of co-expression networks. We com- 
pare co-expression networks of H. sapiens and M. 
musculus. In co-expression networks, the weighted 
link aw G [— 1, 1] between a pair of genes i,j is given 
by the correlation coefficient of their gene expression 
profiles measured on a microarray chip. Genes which 
tend to be expressed under similar conditions thus 
have positive links. The score l|13l) can easily be gen- 
eralized to weighted interactions, see |29| . 

The data of Su et al. (3U| was use d to construct net- 
works of ~ 2000 housekeeping genes. Human-mouse 
orthologs were taken from the Ensembl database j2H] ■ 
Details on the algorithm to maximize the score ()lrtf) 
are given in |29| . 

We focus on strongly conserved parts of the two 
networks. Figure 0] shows a cluster of co-expressed 
genes which is highly conserved between human and 
mouse (link conservation is shown in blue, changes 
between the links in red). 

With one exception, the aligned gene pairs in this 
cluster have significant sequence similarity and are 
thought to be orthologs, stemming from a common 
ancestral gene. The exception is the aligned gene 
pair human-HMGNl/mouse-Parp2. These genes are 
aligned due to their matching links, quantified by 
a high contribution to the link score Q13fl of S = 
25.1. The "false" alignment human-HMGNl/mouse- 
HMGN1 respects sequence similarity but produces a 
link mismatch (S £ = —12.4); see Fig. ^b). Human- 
HMGN1 is known to be involved in chromatin mod- 
ulation and acts as a transcription factor. The net- 
work alignment predicts a similar role of Parp2 in 
mouse, which is distinct from its known function in 
the poly(ADP-ribosyl)ation of nuclear proteins. The 
prediction is compatible with experiments on the ef- 
fect of Parp- inhibition, which suggest that Parp genes 



in mouse play a role in the chromatin modification 
during development |31| . 

6. Towards an evolutionary theory 

Different parts of biological networks have different 
functions. Here we have applied a statistical ap- 
proach to the detection of network clusters, network 
motifs, and cross-species correlations. But the detec- 
tion of deviations from a global background statistics 
has a wider perspective, which includes the connec- 
tion between different type of networks, the link be- 
tween network topology and the underlying sequence, 
and spatiotemporal changes of biological networks. 
From an evolutionary point of view, these devia- 
tions are created and maintained by selection pres- 
sures which are both non-homogeneous and corre- 
lated across the network. A quantitative theory of 
biological networks will thus require a synthesis of 
network statistics and population genetics, a largely 
outstanding task to date. Here we give a brief outlook 
on some of the challenges ahead. 

Genetic interactions between different links. 

Biological function is typically tied to modules con- 
sisting of several nodes and links. As a result, there 
are correlations between links across different species: 
a species with a certain function will tend to have all 
links associated with the specific function, a species 
lacking the function will tend to have none of the 
corresponding links. The network motifs discussed 
above are only a special case of this phenomenon. 
With data on biological networks becoming avail- 
able for an increasing number of species, it will be- 
come feasible to infer these correlations and the cor- 
responding functional modules from data. Scoring 
functions constructed to detect genetic interactions 
in multiple alignments will play an important role in 
this undertaking. 

Gene duplications. Following the duplication of 
a gene, the daughter genes have the same function 
and same interactions with other genes. Indepen- 
dent evolution of the two genes may lead to the 
non-functionalisation, and even the loss of one of 
the duplicates, or to subfunctionalisation, with dif- 
ferent functional roles being divided among the two 
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copies Tracing the dynamics of gene duplica- 

tion at the level of interaction networks gives insights 
into the evolutionary dynamics of networks [201 133) . 
Scoring for jointly conserved subgroups of links can 
be used to identify the different functional modules a 
gene is involved in. This can be done both at the level 
of single species, as well as in a cross-species analysis, 
where gene duplications introduce one-to-many and 
many-to-many alignments. 

Neutral and selective dynamics. Biological net- 
works show a great deal of plasticity, since the same 
biological function can be carried out by different net- 
works (see e.g. H3|). This flexibility leads to neutral 
evolution as a population explores the space of net- 
works corresponding to a given function. On the 
other hand, networks may change as a new func- 
tionality is acquired, or because of changing envi- 
ronmental conditions. Disentangling neutral moves 
and changes under selection is possible by contrast- 
ing inter-species variability with intra-species vari- 
ability 35 . Inferring the modes of network evolu- 
tion and the relative weights of neutral and selective 
dynamics remains an outstanding challenge for ex- 
periment and theory. 
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Appendix: 

Bayesian analysis of network data 

The detection of deviations from a null model can be 
formulated as a problem of deciding between alterna- 
tive hypotheses. The first hypothesis is that a given 
node subset follows the statistics of the null model. 
The alternative hypothesis is that the node subset fol- 
lows a statistics different from the null model. This 
statistics is called the Q-model. 

The choice between these two alternatives can be 
formulated probabilistically, by considering the poste- 
rior probability P{Q\k, A). It describes the probabil- 
ity that the node subset (s) specified by A follow the 
Q- model (hypothesis Q), rather than the null model 
(null- hypothesis Pq). Denoting any prior knowledge 



we may have about the probability with which the 
two alternatives occur by P(Q) and P{Po), respec- 
tively, one may use Bayes' theorem to find 



P(Q\k,A) 



(16) 



P{k\Q,A)P{Q) 
P{k\A) 

P(k\Q,A)P(Q) 

P(k\P , A)P(Po) + P(k\Q, A)P(Q) 

e S'(A) 



1 



,S'(A) 



P(k\Q,A) gives the probability of generating pat- 
terns a under the Q-model (given, for instance, by 
© or by ©). P(k\P ,A) gives the probability of 
generating the same pattern under the null model . 
The posterior probability is thus a monotonously in- 
creasing function of the log-likelihood score given by 



S'(A) 



log 



P(*\Q,A) 



P(k\P Q ,A) 
= S(A) + const. 



log 



P(Q) 

P(Po) 



(17) 



Hence the score S(A) defined in 0} has a sound the- 
oretical foundation: it is a measure of the posterior 
probability that the node subset specified by A fol- 
lows the Q-model rather than the null model. 

This simple picture needs to be extended when the 
parameters m of the Q-model and the alignment A 
are unknown and are considered "hidden" variables 
to be determined from the data. We construct a 
model for the entire network with adjacency matrix 
a, with pattern a(.4) following the Q- m °del, the re- 
mainder of the network following the null model 



P(a\A, m) = Q{k\A, m)P (k\A) 



(18) 



The matrix of links between nodes which are not both 
part of A is denoted by a. Using Bayes' theorem one 
can write the posterior probability of A and m, i.e. 
the conditional probability of the hidden variables, in 
the form 



P(A,m\a) 



Q{a\A,m)P{A, m) 
^2 A 0( a \A,m)P(A,m) 



(19) 



We assume the prior probability P(A, m) to be flat. 
Dropping the terms independent of A and m, the op- 
timal alignment A* is obtained by maximizing the 
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posterior probability Q(A\a) ~ J2 m Q( a IA m ) w i tn 
respect to A and similarly the optimal scoring param- 
eters m* by maximizing Q(m\a) ~ S^Q( a l^ m ) 
with respect to m. In the so-called Viterbi approx- 
imation, A* and m* are inferred by jointly maxi- 
mizing Q(a, b, &\A, m) with respect to A and m. 
Assuming the sum J2 A m Q(a\A, m) can be split 
into the term stemming from A*,m* and a remain- 
der Y,AfA*,m£m* Q( a \Am) ~ P (a), the posterior 
probability (|19|l can again be written in the form 
of l|17|) . In this approximation, the maximum-score 
alignment and the optimal scoring parameters are de- 
termined by the maximum of the log-likelihood score 
(|4*|l over the alignments and over the scoring param- 
eters. 
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